Goto

Collaborating Authors

 bad local minima


Adding One Neuron Can Eliminate All Bad Local Minima

Neural Information Processing Systems

One of the main difficulties in analyzing neural networks is the non-convexity of the loss function which may have many bad local minima. In this paper, we study the landscape of neural networks for binary classification tasks. Under mild assumptions, we prove that after adding one special neuron with a skip connection to the output, or one special neuron per layer, every local minimum is a global minimum.


Adding One Neuron Can Eliminate All Bad Local Minima

Neural Information Processing Systems

One of the main difficulties in analyzing neural networks is the non-convexity of the loss function which may have many bad local minima. In this paper, we study the landscape of neural networks for binary classification tasks. Under mild assumptions, we prove that after adding one special neuron with a skip connection to the output, or one special neuron per layer, every local minimum is a global minimum.


Reviews: Genetic-Gated Networks for Deep Reinforcement Learning

Neural Information Processing Systems

The authors propose a new RL framework that combines gradient-free genetic algorithms with gradient based optimization (policy gradients). The idea is to parameterize an ensemble of actors by using a binary gating mechanism, similar to dropout, between hidden layers. Instead of sampling a new gate pattern at every iteration, as in dropout, each gate is viewed as a gene and the activation pattern as a chromosome. This allows learning the policy with a combination of a genetic algorithm and policy gradients. The authors apply the proposed algorithm to Atari domain, and the results demonstrate significant improvement over standard algorithms. They also apply their method to continuous control (OpenAI gym MuCoJo benchmarks) yielding results that are comparable to standard PPO.


Reviews: Adding One Neuron Can Eliminate All Bad Local Minima

Neural Information Processing Systems

This phenomenon is a bit curious and perhaps deserves more elaboration. This, I am afraid, is likely what is going on here (if you drop the separable assumption). The main contribution of this work is to prove that by adding a single exponential function (directly) from input to output and adding a mild l_2 regularizer, the slightly modified, highly nonconvex loss function does not have any non-global local minima. Moreover, all of these local minima actually correspond to the global minima of the original, unmodified nonconvex loss. This surprising result, to the best of my knowledge, is new and of genuine interest.


5b69b9cb83065d403869739ae7f0995e-Reviews.html

Neural Information Processing Systems

Review of "Low-rank matrix reconstruction and clustering" This paper contributes a new algorithm for low-rank matrix reconstruction which is based on an application of Belief Propagation (BP) message-passing to a Bayesian model of the reconstruction problem. The algorithm, as described in the "Supplementary Material", incorporates two simplifying approximations, based on assuming a large number of rows and columns, respectively, in the input matrix. The algorithm is evaluated in a novel manner against Lloyd's K-means algorithm by formulating clustering as a matrix reconstruction problem. It is also compared against Variational Bayes Matrix Factorization (VBMF), which seems to be the only previous message-passing reconstruction algorithm. Cons There are some arguments against accepting the paper.


Mildly Overparameterized ReLU Networks Have a Favorable Loss Landscape

Karhadkar, Kedar, Murray, Michael, Tseran, Hanna, Montúfar, Guido

arXiv.org Artificial Intelligence

We study the loss landscape of two-layer mildly overparameterized ReLU neural networks on a generic finite input dataset for the squared error loss. Our approach involves bounding the dimension of the sets of local and global minima using the rank of the Jacobian of the parameterization map. Using results on random binary matrices, we show most activation patterns correspond to parameter regions with no bad differentiable local minima. Furthermore, for one-dimensional input data, we show most activation regions realizable by the network contain a high dimensional set of global minima and no bad local minima. We experimentally confirm these results by finding a phase transition from most regions having full rank to many regions having deficient rank depending on the amount of overparameterization.


Maximum-and-Concatenation Networks

Xie, Xingyu, Kong, Hao, Wu, Jianlong, Zhang, Wayne, Liu, Guangcan, Lin, Zhouchen

arXiv.org Machine Learning

While successful in many fields, deep neural networks (DNNs) still suffer from some open problems such as bad local minima and unsatisfactory generalization performance. In this work, we propose a novel architecture called Maximum-and-Concatenation Networks (MCN) to try eliminating bad local minima and improving generalization ability as well. Remarkably, we prove that MCN has a very nice property; that is, \emph{every local minimum of an $(l+1)$-layer MCN can be better than, at least as good as, the global minima of the network consisting of its first $l$ layers}. In other words, by increasing the network depth, MCN can autonomously improve its local minima's goodness, what is more, \emph{it is easy to plug MCN into an existing deep model to make it also have this property}. Finally, under mild conditions, we show that MCN can approximate certain continuous functions arbitrarily well with \emph{high efficiency}; that is, the covering number of MCN is much smaller than most existing DNNs such as deep ReLU. Based on this, we further provide a tight generalization bound to guarantee the inference ability of MCN when dealing with testing samples.


Understanding Global Loss Landscape of One-hidden-layer ReLU Networks, Part 2: Experiments and Analysis

Liu, Bo

arXiv.org Machine Learning

The existence of local minima for one-hidden-layer ReLU networks has been investigated theoretically in [8]. Based on the theory, in this paper, we first analyze how big the probability of existing local minima is for 1D Gaussian data and how it varies in the whole weight space. We show that this probability is very low in most regions. We then design and implement a linear programming based approach to judge the existence of genuine local minima, and use it to predict whether bad local minima exist for the MNIST and CIFAR-10 datasets, and find that there are no bad differentiable local minima almost everywhere in weight space once some hidden neurons are activated by samples. These theoretical predictions are verified experimentally by showing that gradient descent is not trapped in the cells from which it starts. We also perform experiments to explore the count and size of differentiable cells in the weight space.


Adding One Neuron Can Eliminate All Bad Local Minima

LIANG, SHIYU, Sun, Ruoyu, Lee, Jason D., Srikant, R.

Neural Information Processing Systems

One of the main difficulties in analyzing neural networks is the non-convexity of the loss function which may have many bad local minima. In this paper, we study the landscape of neural networks for binary classification tasks. Under mild assumptions, we prove that after adding one special neuron with a skip connection to the output, or one special neuron per layer, every local minimum is a global minimum. Papers published at the Neural Information Processing Systems Conference.


Understanding Global Loss Landscape of One-hidden-layer ReLU Neural Networks

Liu, Bo

arXiv.org Machine Learning

For one-hidden-layer ReLU networks, we show that all local minima are global in each differentiable region, and these local minima can be unique or continuous, depending on data, activation pattern of hidden neurons and network size. We give criteria to identify whether local minima lie inside their defining regions, and if so (we call them genuine differentiable local minima), their locations and loss values. Furthermore, we give necessary and sufficient conditions for the existence of saddle points as well as non-differentiable local minima. Finally, we compute the probability of getting stuck in genuine local minima for Gaussian input data and parallel weight vectors, and show that it is exponentially vanishing when the weights are located in regions where data are not too scarce. This may give a hint to the question why gradient-based local search methods usually do not get trapped in local minima when training deep ReLU neural networks.